Goto

Collaborating Authors

 molotov cocktail


How a fiery attack on Sam Altman's home unfolded

The Guardian

Sam Altman speaks during the BlackRock infrastructure summit on 11 March in Washington DC. Sam Altman speaks during the BlackRock infrastructure summit on 11 March in Washington DC. How a fiery attack on Sam Altman's home unfolded Molotov cocktail attack on OpenAI CEO's home comes amid growing discontent against artificial intelligence I n the early hours of 10 April, a man approached the gate of OpenAI CEO Sam Altman's house in San Francisco and hurled a molotov cocktail at the building before fleeing. Federal and California state authorities have charged Moreno-Gama with a range of crimes including attempted arson and attempted murder. His parents issued a statement this week saying that their son had recently suffered a mental health crisis.


A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks

arXiv.org Artificial Intelligence

Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a "benign" region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.


Safeguarding AI Agents: Developing and Analyzing Safety Architectures

arXiv.org Artificial Intelligence

AI agents, specifically powered by large language models, have demonstrated exceptional capabilities in various applications where precision and efficacy are necessary. However, these agents come with inherent risks, including the potential for unsafe or biased actions, vulnerability to adversarial attacks, lack of transparency, and tendency to generate hallucinations. As AI agents become more prevalent in critical sectors of the industry, the implementation of effective safety protocols becomes increasingly important. This paper addresses the critical need for safety measures in AI systems, especially ones that collaborate with human teams. We propose and evaluate three frameworks to enhance safety protocols in AI agent systems: an LLM-powered input-output filter, a safety agent integrated within the system, and a hierarchical delegation-based system with embedded safety checks. Our methodology involves implementing these frameworks and testing them against a set of unsafe agentic use cases, providing a comprehensive evaluation of their effectiveness in mitigating risks associated with AI agent deployment. We conclude that these frameworks can significantly strengthen the safety and security of AI agent systems, minimizing potential harmful actions or outputs. Our work contributes to the ongoing effort to create safe and reliable AI applications, particularly in automated operations, and provides a foundation for developing robust guardrails to ensure the responsible use of AI agents in real-world applications.


Does Refusal Training in LLMs Generalize to the Past Tense?

arXiv.org Artificial Intelligence

Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g., "How to make a Molotov cocktail?" to "How did people make a Molotov cocktail?") is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future tense are less effective, suggesting that refusal guardrails tend to consider past historical questions more benign than hypothetical future questions. Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. Overall, our findings highlight that the widely used alignment techniques -- such as SFT, RLHF, and adversarial training -- employed to align the studied models can be brittle and do not always generalize as intended. We provide code and jailbreak artifacts at https://github.com/tml-epfl/llm-past-tense.


SelfIE: Self-Interpretation of Large Language Model Embeddings

arXiv.org Artificial Intelligence

How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.


2 charged with hate crimes after black family's home is hit by Molotov cocktails and racist graffiti

Los Angeles Times

In a crime that shocked a California Delta community, a man and woman were charged with hate crimes Tuesday in connection with launching Molotov cocktails into the home of a black family in Antioch and spray-painting the residence with a swastika and racial slurs, police said. Roy Charles Sorvari, 27, of Antioch and Christyne Gail McDaniel, 25, of Brentwood face charges of arson and conspiracy to commit murder, mayhem, torture and assault with a deadly weapon, according to the Antioch Police Department. Sorvari and McDaniel were also charged with hate crime enhancements. They have each been ordered held on more than 1 million bail. The attack "sent shockwaves in the city of Antioch," Police Chief Allan Cantando said at news conference Tuesday.


Man Arrested for Throwing Molotov Cocktails at Google Street View Car

TIME - Tech

A man has been charged with felony arson after authorities say he threw Molotov cocktails at a Google car parked outside a company building in Mountain View, Calif. Raul Murillo Diaz, 30, threw several beer bottles turned into Molotov cocktails at a Google Street View car parked outside Google's building, prosecutors said, according to NBC Bay Area. While one of the bottles caused a fire, the car did not explode. Diaz told law enforcement he "felt Google was watching him and it made him upset," according to an affidavit. Diaz also told authorities he was involved in two other incidents related to Google, including burning one of Google's self-driving cars, but he has only been charged with one count of arson so far.